On Cross-Script Information Retrieval

نویسندگان

Nada Naji

James Allan

چکیده

We address the problem of cross-script retrieval in the context of a microblog system such as Twitter. Specifically, we explore methods for using native Arabic script queries to retrieve Arabic tweets written in a Roman script known as Arabizi. For example, a query for “بباتك” would not match “kitab” even though an Arabic reader would see them as the same word. Moreover, because of the lack of Arabic script, automatic language identification methods fail to recognize the Arabizi text as Arabic and label it as English, Polish, or the like. We propose a cross-script retrieval system using automatic rule-based mapping and statistical selection of transliteration keywords. We show that our system can achieve effective cross-script retrieval with minimal knowledge of the target language and without the need to rely on external translation or transliteration tools or lexica. With minimal human annotation, our technique can be applied to other languages such as Hindi and Greek, which are commonly converted to a Roman character set similarly.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Maryland at FIRE 2011: Retrieval of OCR'd Bengali

In this year's Forum for Information Retrieval Evaluation (FIRE), the University of Maryland participated in the Retrieval of Indic Script OCRed Text (RISOT) task to experiment with the retrieval of Bengali script OCR’d documents. The experiments focused on evaluating a retrieval strategy motivated by recent work on Cross-Language Information Retrieval (CLIR), but which makes use of OCR error m...

متن کامل

Punjabi Machine Transliteration

Machine Transliteration is to transcribe a word written in a script with approximate phonetic equivalence in another language. It is useful for machine translation, cross-lingual information retrieval, multilingual text and speech processing. Punjabi Machine Transliteration (PMT) is a special case of machine transliteration and is a process of converting a word from Shahmukhi (based on Arabic s...

متن کامل

Using Transliteration of Proper Names from Arabic to Latin Script to Improve English-Arabic Word Alignment

Bilingual lexicons of proper names play a vital role in machine translation and cross-language information retrieval. Word alignment approaches are generally used to construct bilingual lexicons automatically from parallel corpora. Aligning proper names is a task particularly difficult when the source and target languages of the parallel corpus do not share a same written script. We present in ...

متن کامل

NLP-NITMZ @ MSIR 2016 System for Code-Mixed Cross-Script Question Classification

This paper describes our approach on Code–Mixed Cross– Script Question Classification task, which is a subtask 1 of MSIR 2016. MSIR is a Mixed Script Information Retrieval event in conjunction with FIRE 2016, which is the 8th meeting of Forum for Information Retrieval Evaluation. For this task, our team NLP–NITMZ submitted three system runs such as: i) using a direct feature set; ii) using dire...

متن کامل

Study of the impact of proper name transliteration on the performance of word alignment in French-Arabic parallel corpora (Etude de l'impact de la translittération de noms propres sur la qualité de l'alignement de mots à partir de corpus parallèles français-arabe) [in French]

Bilingual lexicons play a vital role in cross-language information retrieval and machine translation. The manual construction of these lexicons is often costly and time consuming. Word alignment techniques are generally used to construct bilingual lexicons from parallel texts. Aligning single words and nominal syntagms from parallel texts is relatively a well controlled task for languages using...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2016

On Cross-Script Information Retrieval

نویسندگان

چکیده

منابع مشابه

Maryland at FIRE 2011: Retrieval of OCR'd Bengali

Punjabi Machine Transliteration

Using Transliteration of Proper Names from Arabic to Latin Script to Improve English-Arabic Word Alignment

NLP-NITMZ @ MSIR 2016 System for Code-Mixed Cross-Script Question Classification

Study of the impact of proper name transliteration on the performance of word alignment in French-Arabic parallel corpora (Etude de l'impact de la translittération de noms propres sur la qualité de l'alignement de mots à partir de corpus parallèles français-arabe) [in French]

عنوان ژورنال:

اشتراک گذاری